Perceiver: General Perception with Iterative Attention

In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets.

images, point clouds, audio, video, and video+audio

マルチモーダルの研究（画像と音声）

https://github.com/deepmind/deepmind-research/tree/master/perceiver